Skip to content

Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow #34842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

Conversation

kylase
Copy link

@kylase kylase commented Jun 17, 2020

Infer the file system to be passed to pyarrow based on the path provided.

@pep8speaks
Copy link

pep8speaks commented Jun 17, 2020

Hello @kylase! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-18 09:04:33 UTC

@simonjayhawkins simonjayhawkins added the IO Parquet parquet, feather label Jun 17, 2020
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We'll need tests and a whatsnew for this.

@@ -104,19 +104,26 @@ def write(
from_pandas_kwargs["preserve_index"] = index

table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

fs = get_fs_for_path(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be done at a higher level in the base class and passed through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it filesystem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FastParquetImpl doesn't require this as it uses get_file_and_filesystem to obtain the filesystem in the read method.

fs is used in the read method for PyArrowImpl. To be consistent with it, fs is a better choice.

# write_to_dataset does not support a file-like object when
# a directory path is used, so just pass the path string.
if partition_cols is not None:
self.api.parquet.write_to_dataset(
table,
path,
compression=compression,
filesystem=fs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user provides a filesystem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I realised this after looking at the tests that failed.

Perhaps, filesystem should be one of the keyword arguments instead of getting it from kwargs?

I think we need to update the doc to mention that filesystem is needed if the system can't obtain the underlying credentials from e.g. ~/.aws/credentials?

@kylase kylase marked this pull request as draft June 17, 2020 12:13
@kylase kylase marked this pull request as ready for review June 17, 2020 17:24
@kylase kylase requested a review from TomAugspurger June 17, 2020 17:24
@kylase kylase changed the title Pass filesystem to parquet.write_table and parquet.write_to_dataset Infer filesystem from path when writing a partitioned DataFrame to remote file systems Jun 18, 2020
@kylase kylase changed the title Infer filesystem from path when writing a partitioned DataFrame to remote file systems Infer filesystem from path when writing a partitioned DataFrame to remote file systems using pyarrow Jun 18, 2020
@@ -568,6 +568,24 @@ def test_s3_roundtrip_for_dir(self, df_compat, s3_resource, pa, partition_col):
repeat=1,
)

@td.skip_if_no("s3fs")
@pytest.mark.parametrize("partition_col", [["A"], []])
def test_s3_roundtrip_for_dir_infer_fs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks identical to https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py - could we parametrize instead here?

@@ -104,19 +104,23 @@ def write(
from_pandas_kwargs["preserve_index"] = index

table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

fs = kwargs.pop("filesystem", get_fs_for_path(path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should wait till: https://github.com/pandas-dev/pandas/pull/34266/files#diff-0d7b5a2c72b4dfc11d80afe159d45ff8L153 is merged. Since this method is changing. @TomAugspurger might have thoughts

Copy link
Author

@kylase kylase Jun 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that #34266 is a more robust solution, which should be in v1.1?

This PR could be a patch for 1.0.x as to_parquet with partition_cols specified is not working as designed. Without partition_cols it works as intended without the need to specify the file system explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We original backported this functionality to 1.0.x but it was then reverted in #34632 -> So there is no s3 directory functionality on 1.0.x currently. See whatsnew: https://pandas.pydata.org/docs/whatsnew/v1.0.5.html#fixed-regressions. So think this will have to wait for 1.1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good. I will alter this after #34266 is merged.

@jorisvandenbossche
Copy link
Member

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

@kylase
Copy link
Author

kylase commented Jul 15, 2020

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

Will take a look at it over the weekend. With a quick glance at the change, I think it should be fixed but we should add the test in to make sure the file system is inferred with just the path given.

@alimcmaster1
Copy link
Member

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

Will take a look at it over the weekend. With a quick glance at the change, I think it should be fixed but we should add the test in to make sure the file system is inferred with just the path given.

@kylase are you still interesting in working on this? whatsnew will be 1.2 now

Thanks

@kylase
Copy link
Author

kylase commented Aug 21, 2020

I have

#34266 has been merged in the meantime. @kylase if you have time to revisit this?

Will take a look at it over the weekend. With a quick glance at the change, I think it should be fixed but we should add the test in to make sure the file system is inferred with just the path given.

@kylase are you still interesting in working on this? whatsnew will be 1.2 now

Thanks

I have looked at the tests in 1.1 and there are tests for explicit filesystem and inferred. Therefore I feel that the issue has been resolved.

Will proceed to close this PR and the issue.

@kylase kylase closed this Aug 21, 2020
@kylase kylase deleted the fix/parquet-write-dataset-s3 branch August 21, 2020 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: to_parquet write partitioned DataFrame to local filesystem for other filesystems (e.g. S3)
6 participants